arXiv: 1506.06833v2 [cs.CL] 19 Aug 2015 


A Survey of Current Datasets for Vision and Language Research 

Francis Ferraro^*, Nasrin Mostafazadeh^*, Ting-Hao (Kenneth) Huang^, 
Lucy Vanderwende^, Jacob Devlin"^, Michel Galley^, Margaret MitchelF 

Microsoft Research 

1 Johns Hopkins University, 2 University of Rochester, 3 Carnegie Mellon University, 
4 Corresponding authors: {lucyv,jdevlin,mgalley,memitc}@microsoft.com 


Abstract 

Integrating vision and language has long 
been a dream in work on artifieial intel- 
ligenee (AI). In the past two years, we 
have witnessed an explosion of work that 
brings together vision and language from 
images to videos and beyond. The avail¬ 
able eorpora have played a erueial role in 
advaneing this area of researeh. In this 
paper, we propose a set of quality met- 
ries for evaluating and analyzing the vi¬ 
sion & language datasets and eategorize 
them aeeordingly. Our analyses show that 
the most reeent datasets have been us¬ 
ing more eomplex language and more ab- 
straet eoneepts, however, there are differ¬ 
ent strengths and weaknesses in eaeh. 

1 Introduction 

Bringing together language and vision in one in¬ 
telligent system has long been an ambition in AI 
researeh, beginning with SHRDLU as one of the 
first vision-language integration systems (Wino- 
grad, 1972) and eontinuing with more reeent at¬ 
tempts on eonversational robots grounded in the 
visual world (Kollar et al, 2013; Cantrell et al, 
2010; Matuszek et al, 2012; Kruijff et al, 2007; 
Roy et al., 2003). In the past few years, an influx 
of new, large vision & language eorpora, along¬ 
side dramatie advanees in vision researeh, has 
sparked renewed interest in eonneeting vision and 
language. Vision & language eorpora now provide 
alignments between visual eontent that ean be ree- 
ognized with Computer Vision (CV) algorithms 
and language that ean be understood and generated 
using Natural Language Proeessing teehniques. 

Fueled in part by the newly emerging data, re¬ 
seareh that blends teehniques in vision and in lan¬ 
guage has inereased at an ineredible rate. In just 
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the past year, reeent work has proposed meth¬ 
ods for image and video eaptioning (Fang et al., 
2014; Donahue et al., 2014; Venugopalan et al., 
2015), summarization (Kim et al., 2015), refer- 
enee (Kazemzadeh et al., 2014), and question an¬ 
swering (Antol et al., 2015; Gao et al., 2015), to 
name just a few. The newly erafted large-seale vi¬ 
sion & language datasets have played a erueial role 
in defining this researeh, serving as a foundation 
for training/testing and helping to set benehmarks 
for measuring system performanee. 

Crowdsoureing and large image eolleetions 
sueh as those provided by Fliekr^ have made it 
possible for researehers to propose methods for vi¬ 
sion and language tasks alongside an aeeompany- 
ing dataset. However, as more and more datasets 
have emerged in this spaee, it has beeome un- 
elear how different methods generalize beyond the 
datasets they are evaluated on, and what data may 
be useful for moving the field beyond a single task, 
towards solving larger AI problems. 

In this paper, we take a step baek to doeument 
this moment in time, making a reeord of the ma¬ 
jor available eorpora that are driving the field. We 
provide a quantitative analysis of eaeh of these 
eorpora in order to understand the eharaeteristies 
of eaeh, and how they eompare to one another. 
The quality of a dataset must be measured and 
eompared to related datasets, as low quality data 
may distort an entire subfield. We propose a set of 
eriteria for analyzing, evaluating and eomparing 
the quality of vision & language datasets against 
eaeh other. Knowing the details of a dataset eom¬ 
pared to similar datasets allows researehers to de¬ 
fine more preeisely what task(s) they are trying to 
solve, and seleet the dataset(s) best suited to their 
goals, while being aware of the implieations and 
biases the datasets eould impose on a task. 

We eategorize the available datasets into three 
major elasses and evaluate them against these eri- 

’http://www.flickr.com 



teria. The datasets we present here were ehosen 
because they are all available to the community 
and cover the data that has been created to sup¬ 
port the recent focus on image captioning work. 
More importantly, we provide an evolving web¬ 
site^ containing pointers and references to many 
more vision-to-language datasets, which we be¬ 
lieve will be valuable in unifying the quickly ex¬ 
panding research tasks in language and vision. 

2 Quality Criteria for Language & 

Vision Datasets 

The quality of a dataset is highly dependent on 
the sampling and scraping techniques used early 
in the data collection process. However, the con¬ 
tent of datasets can play a major role in narrowing 
the focus of the field. Datasets are affected by both 
reporting bias (Gordon and Durme, 2013), where 
the frequency with which people write about ac¬ 
tions, events, or states does not directly reflect 
real-world frequencies of those phenomena; they 
are also affected by photographer’s bias (Torralba 
and Efros, 2011), where photographs are some¬ 
what predictable within a given domain. This sug¬ 
gests that new datasets may be useful towards the 
larger AI goal if provided alongside a set of quanti¬ 
tative metrics that show how they compare against 
similar corpora, as well as more general “back¬ 
ground” corpora. Such metrics can be used as in¬ 
dicators of dataset bias and language richness. At 
a higher level, we argue that clearly defined mef- 
rics are necessary fo provide quanfifafive measure- 
menfs of how a new dafasef compares fo previous 
work. This helps clarify and benchmark how re¬ 
search is progressing fowards a broader AI goal as 
more and more dafa comes info play. 

In fhis secfion, we propose a sef of such mefrics 
fhaf characferize vision & language dafasefs. We 
focus on mefhods fo measure language quality fhaf 
can be used across several corpora. We also briefly 
examine mefrics for vision quality. We evaluafe 
several recenf dafasefs based on all proposed mef- 
rics in Secfion 4, wifh resulfs reporfed in Tables 1, 
2, and Figure 1. 

2,1 Language Quality 

We define fhe following criteria for evaluating fhe 
captions or insfrucfions of fhe dafasefs: 

• Vocabulary Size {#vocab), fhe number of 
unique vocabulary words. 

^http://visionandlanguage.net 


• Syntactic Complexity {Frazier, Yngve) mea¬ 
sures the amount of embedding/branching in a 
sentence’s syntax. We report mean Yngve (Yngve, 
1960) and Frazier measurements (Frazier, 1985); 
each provides a different counting on the number 
of nodes in the phrase markers of syntactic trees. 

• Part of Speech Distribution measures the dis¬ 
tribution of nouns, verbs, adjectives, and other 
parts of speech. 

• Abstract: Concrete Ratio (#Conc, #Abs, 
%Abs) indicates the range of visual and non-visual 
concepts the dataset covers. Abstract terms are 
ideas or concepts, such as ‘love’ or ‘think’ and 
concrete terms are all the objects or events that are 
mainly available to the senses. For this purpose, 
we use a list of most common abstract terms in En¬ 
glish (Vanderwende et ah, 2015), and define con¬ 
crete terms as all other words except for a small 
set of function words. 

• Average Sentence Length {Sent Len.) shows 
how rich and descriptive the sentences are. 

• Perplexity provides a measure of data skew 
by measuring how expected sentences are from 
one corpus according to a model trained on an¬ 
other corpus. We analyze perplexity {Ppl) for each 
dataset against a 5-gram language model learned 
on a generic 30B words English dataset. We 
further analyze pair-wise perplexity of datasets 
against each other in Section 4. 

2.2 Vision Quality 

Our focus in this survey is mainly on language, 
however, the characteristics of images or videos 
and their corresponding annotations is as impor¬ 
tant in vision & language research. The quality of 
vision in a dataset can be characterized in part by 
the variety of visual subjects and scenes provided, 
as well as the richness of the annotations (e.g., seg¬ 
mentation using bounding boxes {BB) or visual de¬ 
pendencies between boxes). Moreover, a vision 
corpus can use abstract or real images {Abs/Real). 

3 The Available Datasets 

We group a representative set of available datasets 
based on their content. For a complete list of 
datasets and their descriptions, please refer to the 
supplementary website.^ 

3.1 Captioned Images 

Several recent vision & language datasets provide 
one or multiple captions per image. The captions 



of these datasets are either the original photo ti¬ 
tle and deseriptions provided by online users (Or¬ 
donez et ah, 2011; Thomee et ah, 2015), or the 
captions generated by crowd workers for existing 
images. The former datasets tend to be larger in 
size and contain more contextual descriptions. 

3.1.1 User-generated Captions 

• SBU Captioned Photo Dataset (Ordonez et ah, 
2011) contains 1 million images with original user 
generated captions, collected in the wild by sys¬ 
tematic querying of Flickr. This dataset is col¬ 
lected by querying Flickr for specific terms such as 
objects and actions and then filtered images with 
descriptions longer than certain mean length. 

• Deja Images Dataset (Chen et ah, 2015) con¬ 
sists of 180K unique user-generated captions as¬ 
sociated with 4M Flickr images, where one cap¬ 
tion is aligned with multiple images. This dataset 
was collected by querying Flickr for 693 high fre¬ 
quency nouns, then further filtered to have at least 
one verb and be judged as “good” captions by 
workers on Amazon’s Mechanical Turk (Turkers). 

3.1.2 Crowd-sourced Captions 

• UIUC Pascal Dataset (Farhadi et ah, 2010) is 
probably one of the first datasets aligning images 
with captions. Pascal dataset contains 1,000 im¬ 
ages with 5 sentences per image. 

• Flickr 30K Images (Young et ah, 2014) extends 
previous Flickr datasets (Rashtchian et ah, 2010), 
and includes 158,915 crowd-sourced captions that 
describe 31,783 images of people involved in ev¬ 
eryday activities and events. 

• Microsoft COCO Dataset (MS COCO) (Lin et 
ah, 2014) includes complex everyday scenes with 
common objects in naturally occurring contexts. 
Objects in the scene are labeled using per-instance 
segmentations. In total, this dataset contains pho¬ 
tos of 91 basic object types with 2.5 million la¬ 
beled instances in 328k images, each paired with 5 
captions. This dataset gave rise to the CVPR 2015 
image captioning challenge and is continuing to be 
a benchmark for comparing various aspects of vi¬ 
sion and language research. 

• Abstract Scenes Dataset (Clipart) (Zitnick et 
ah, 2013) was created with the goal of represent¬ 
ing real-world scenes with clipart to study scene 
semantics isolated from object recognition and 
segmentation issues in image processing. This re¬ 
moves the burden of low-level vision tasks. This 
dataset contains 10,020 images of children playing 


outdoors associated with total 60,396 descriptions. 

3.1.3 Captions of Densely Labeled Images 

Existing caption datasets provide images paired 
with captions, but such brief image descriptions 
capture only a subset of the content in each image. 
Measuring the magnitude of the reporting bias in¬ 
herent in such descriptions helps us to understand 
the discrepancy between what we can learn for 
the specific fask of image captioning versus whaf 
we can learn more generally from fhe photographs 
people lake. One dalasel useful fo Ibis end pro¬ 
vides image annofafion for conlenf selecfion: 

• Microsoft Research Dense Visual Annotation 
Corpus (Yafskar et ah, 2014) provides a set of 500 
images from the Flickr 8K dataset (Rashtchian et 
ah, 2010) that are densely labeled with 100,000 
textual labels, with bounding boxes and facets an¬ 
notated for each object. This approximates “gold 
standard” visual recognition. 

To get a rough estimate of the reporting bias in 
image captioning, we determined the percentage 
of top-level objects^ that are mentioned in the cap¬ 
tions for this dataset out of all the objects that are 
annotated. Of the average 8.04 available top-level 
objects in the image, each of the captions only re¬ 
ports an average of 2.7 of these objects.^ A more 
detailed analysis of reporting bias is beyond the 
scope of this paper, but we found that many of the 
biases (e.g., people selection) found with abstract 
scenes (Zitnick et ah, 2013) are also present with 
photos. 

3.2 Video Description and Instruction 

Video datasets aligned with descriptions (Chen et 
ah, 2010; Rohrbach et ah, 2012; Regneri et ah, 
2013; Naim et ah, 2015; Mai maud et ah, 2015) 
generally represent limited domains and small lex¬ 
icons, which is due to the fact that video process¬ 
ing and understanding is a very compute-intensive 
task. Available datasets include: 

• Short Videos Described with Sentences (Yu 
and Siskind, 2013) includes 61 video clips (each 
35 seconds in length, filmed in three different 

^This visual annotation consists of a two-level hierarchy, 
where multiple Turkers enumerated and located objects and 
stuff in each image, and these objects were then further la¬ 
beled with finer-grained object information {Has attributes). 

“^We did not use an external synonym or paraphrasing re¬ 
source to perform the matching between labels and captions, 
as the dataset itself provides paraphrases for each object: each 
object is labeled by multiple Turkers, who labeled ha rela¬ 
tions (e.g., “eagle” is a “bird”). 
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Dataset 

Img 

Txt 

Frazier 

Yngve 

Vocab 
Size (k) 

Sent 

Len. 

#Conc 

#Abs 

%Abs 

Ppl 

(A)bs/ 

(R)eal 

BB 

Balanced 

Brown 

- 

52 

18.5 

77.21 

47.7 

20.82 

40411 

7264 

15.24% 

194 

- 

- 


SBU 

1000 

1000 

9.70 

26.03 

254.6 

13.29 

243940 

9495 

3.74% 

346 

R 

- 

User-Gen 

Deja 

4000 

180 

4.13 

4.71 

38.3 

4.10 

34581 

3714 

9.70% 

184 

R 

- 

Crowd- 

sourced 

Paseal 

1 

5 

8.03 

25.78 

3.4 

10.78 

2741 

591 

17.74% 

123 

R 

- 

FliekrSOK 

32 

159 

9.50 

27.00 

20.3 

12.98 

17214 

3033 

14.98% 

118 

R 

- 

COCO 

328 

2500 

9.11 

24.92 

24.9 

11.30 

21607 

3218 

12.96% 

121 

R 

Y 


Clipart 

10 

60 

6.50 

12.24 

2.7 

7.18 

2202 

482 

17.96% 

126 

A 

Y 

Video 

VDC 

2 

85 

6.71 

15.18 

13.6 

7.97 

11795 

1741 

12.86% 

148 

R 

- 


VQA 

10 

330 

6.50 

14.00 

6.2 

7.58 

5019 

1194 

19.22% 

113 

A/R 

- 

Beyond 

CQA 

123 

118 

9.69 

11.18 

10.2 

8.65 

8501 

1636 

16.14% 

199 

R 

Y 


VML 

11 

360 

6.83 

12.72 

11.2 

7.56 

9220 

1914 

17.19% 

110 

R 

Y 


Table 1: Summary of statistics and quality metrics of a sample set of major datasets. For Brown, we report Frazier and Yngve 
scores on automatically acquired parses, but we also compute them for the 24K sentences with gold parses: in this setting, the 
mean Frazier score is 15.26 while the mean Yngve score is 58.48. 


outdoor environments), showing multiple simul¬ 
taneous events between a subset of four objeets: 
a person, a baekpaek, a ehair, and a trash-ean. 
Eaeh video was manually annotated (with very re- 
strieted grammar and lexieon) with several sen- 
tenees deseribing what oeeurs in the video. 

• Microsoft Research Video Description Cor¬ 
pus (MS VDC) (Chen and Dolan, 2011) eon- 
tains parallel deseriptions (85,550 English ones) 
of 2,089 short video snippets (10-25 seeonds 
long). The deseriptions are one sentenee sum¬ 
maries about the aetions or events in the video 
as deseribed by Amazon Turkers. In this dataset, 
both paraphrase and bilingual alternatives are eap- 
tured, henee, the dataset ean be useful translation, 
paraphrasing, and video deseription purposes. 

3.3 Beyond Visual Description 

Recent work has demonstrated that n-gram lan¬ 
guage modeling paired with scene-level under¬ 
standing of an image trained on large enough 
datasets can result in reasonable automatically 
generated captions (Eang et ah, 2014; Donahue 
et ah, 2014). Some works have proposed to step 
beyond description generation, towards deeper AI 
tasks such as question answering (Ren et ah, 2015; 
Malinowski and Eritz, 2014). We present two of 
these attempts below: 

• Visual Madlibs Dataset (VML) (Yu et ah, 

2015) is a subset of 10,783 images from the MS 
COCO dataset which aims to go beyond describ¬ 
ing which objects are in the image. Eor a given 
image, three Amazon Turkers were prompted 
to complete one of 12 fill-in-the-blank template 
questions, such as ‘when I look at this picture, 
I feel selected automatically based on the im¬ 
age content. This dataset contains a total of 


360,001 MadEib question and answers. 

• Visual Question Answering (VQA) Dataset 

(Antol et ah, 2015) is created for the task of open- 
ended VQA, where a system can be presented with 
an image and a free-form natural-language ques¬ 
tion (e.g., ‘how many people are in the photo?’), 
and should be able to answer the question. This 
dataset contains both real images and abstract 
scenes, paired with questions and answers. Real 
images include 123,285 images from MS COCO 
dataset, and 10,000 clip-art abstract scenes, made 
up from 20 ‘paperdoll’ human models with ad¬ 
justable limbs and over 100 objects and 31 ani¬ 
mals. Amazon Turkers were prompted to create 
‘interesting’ questions, resulting in 215,150 ques¬ 
tions and 430,920 answers. 

• Toronto COCO-QA Dataset (CQA) (Ren et 
ah, 2015) is also a visual question answering 
dataset, where the questions are automatically 
generated from image captions of MS COCO 
dataset. This dataset has a total of 123,287 im¬ 
ages with 117,684 questions with one-word an¬ 
swer about objects, numbers, colors, or locations. 

4 Analysis 

We analyze the datasets introduced in Section 3 
according to the metrics defined in Section 2, us¬ 
ing the Stanford CoreNEP suite to acquire parses 
and part-of-speech tags (Manning et ah, 2014). 
We also include the Brown corpus (Erancis and 
Kucera, 1979; Marcus et ah, 1999) as a reference 
point. We find evidence fhaf fhe VQA dafasef cap- 
fures more absfracf concepfs fhan ofher dafasefs, 
wifh almosf 20% of fhe words found in our ab¬ 
sfracf concepf resource. The Deja corpus has fhe 
leasf number of absfracf concepfs, followed by 
COCO and VDC. This reflecls differences in col- 




Brown 

Clipart 

Coco 

Fllckr30K 

CQA 

VDC 

VQA 

Pascal 

SBU 

Brown 

237.1 

99.6 

560.8 

405.0 

354.039 

187.3 

126.5 

47.8 

621.5 

Clipart 

233.6 

11.2 

117.4 

109.4 

210.8 

82.5 

114.7 

28.7 

130.6 

Coco 

274.6 

59.2 

36.2 

75.3 

137.0 

87.1 

236.9 

39.3 

111.0 

Flickr30K 

247.8 

78.5 

54.3 

37.8 

181.5 

72.1 

192.2 

39.9 

125.0 

CQA 

489.4 

186.1 

137.0 

244.5 

33.8 

259.0 

72.1 

74.9 

200.1 

VDC 

200.5 

52.4 

61.5 

51.1 

289.9 

30.0 

180.1 

28.7 

154.5 

VQA 

425.9 

368.8 

366.8 

665.8 

317.7 

455.0 

19.6 

119.3 

281.0 

Pascal 

265.2 

64.5 

43.2 

63.4 

174.2 

83.0 

228.2 

36.0 

105.3 

SBC 

473.9 

107.1 

346.4 

344.0 

328.5 

230.7 

194.3 

78.2 

119.8 

th’ocab 

14.0k 

1.1k 

13k 

9.4k 

5.3 k 

4.9k 

1.4k 

1.0k 

65.1k 


Table 2: Perplexities across corpora, where rows represent test sets (20k sentences) and columns training sets (remaining 
sentences). To make perplexities comparable, we used the same vocabulary frequency cutoff of 3. All models are 5-grams. 
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Figure 1; Simplified part-of-speech distributions for the eight 
datasets. We include the POS tags from the balanced Brown 
corpus (Marcus et al., 1999) to contextualize any very shal¬ 
low syntactic biases. We mapped all nouns to “N,” all verbs 
to “V,” all adjectives to “J” and all other POS tags to “O.” 

lecting the various corpora: For example, the Deja 
corpus was collected to find specifically visual 
phrases that can be used to describe multiple im¬ 
ages. This corpus also has the most syntactically 
simple phrases, as measured by both Frazier and 
Yngve; this is likely caused by the phrases needing 
to be general enough to capture multiple images. 

The most syntactically complex sentences are 
found in the FlickrSOK, COCO and CQA datasets. 
However, the CQA dataset suffers from a high per¬ 
plexity against a background corpus relative to the 
other datasets, at odds with relatively short sen¬ 
tence lengths. This suggests that the automatic 
caption-to-question conversion may be creating 
unexpectedly complex sentences that are less re¬ 
flective of general language usage. In contrast, 
the COCO and FlickrSOK dataset’s relatively high 
syntactic complexity is in line with their relatively 


high sentence length. 

Table 2 illustrates further similarities between 
datasets, and a more fine-grained use of perplex¬ 
ity to measure the usefulness of a given train¬ 
ing set for predicting words of a given test set. 
Some datasets such as COCO, FlickrSOK, and Cli¬ 
part are generally more useful as out-domain data 
compared to the QA datasets. Test sets for VQA 
and CQA are quite idiosyncratic and yield poor 
perplexity unless trained on in-domain data. As 
shown in Figure 1, the COCO dataset is balanced 
across POS tags most similarly to the balanced 
Brown corpus (Marcus et ah, 1999). The Clipart 
dataset provides the highest proportion of verbs, 
which often correspond to actions/poses in vision 
research, while the FlickrSOK corpus provides the 
most nouns, which often correspond to object/stuff 
categories in vision research. 

We emphasize here that the distinction between 
a qualitatively good or bad dataset is task depen¬ 
dent. Therefore, all these metrics and the obtained 
results provide the researchers with an objective 
set of criteria so that they can make the decision 
whether a dataset is suitable to a particular task. 

5 Conclusion 

We detail the recent growth of vision & language 
corpora and compare and contrast several recently 
released large datasets. We argue that newly in¬ 
troduced corpora may measure how they compare 
to similar datasets by measuring perplexity, syn¬ 
tactic complexity, abstract:concrete word ratios, 
among other metrics. By leveraging such met¬ 
rics and comparing across corpora, research can 
be sensitive to how datasets are biased in different 
directions, and define new corpora accordingly. 
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